Using Decision Trees and Support Vector Machines to Classify Genes by Names
نویسندگان
چکیده
In this paper we report an application of machine learning methods to classify gene names into two categories: known and unknown ones. We acquired a data set of 1,624 genes by letting a human expert classify them manually. To capture the knowledge of classification, we also asked the expert to derive a set of rules. In parallel, we trained two machine learners to capture the same knowledge. Both decision trees (CART) and Support Vector Machines (SVMs) outperform the expert rules; the cross-validated error rates are below 1%, and the area under the curve of Receiver Operating Characteristic (ROC) curves reach higher than 0.99. In summary, CART and SVMs reduced the overall error rate of prediction by 40% and 88%, respectively, compared with classification using expert rule sets. In addition, the machine classifiers are able to find some errors made by the human expert himself. Finally, we used the expert system to classify 7,447 genes on the Affymetrix U74A microarray chip. Results show 70% of the genes on this chip are known ones. In conclusion, we successfully demonstrate that the machine-derived classifiers are more capable of handling the job efficiently than the expertderived classifier. It further supports the idea that in many application domains, experts can perform the task, but cannot tell how; whereas expert systems are able to capture the knowledge from the experts.
منابع مشابه
کاربرد الگوریتمهای دادهکاوی در تفکیک منابع رسوبی حوزۀ آبخیز نوده گناباد
Introduction: Reduction of sediment supply requires the implementation of soil conservation and sediment control programs in the form of watershed management plans. Sediment control programs require identifying the relative importance of sediment sources, their quantitative ascription and identification of critical areas within the watersheds. The sediment source ascription is involves two...
متن کاملFeature Selection and Classification of Microarray Gene Expression Data of Ovarian Carcinoma Patients using Weighted Voting Support Vector Machine
We can reach by DNA microarray gene expression to such wealth of information with thousands of variables (genes). Analysis of this information can show genetic reasons of disease and tumor differences. In this study we try to reduce high-dimensional data by statistical method to select valuable genes with high impact as biomarkers and then classify ovarian tumor based on gene expression data of...
متن کاملMachine Learning Using for Classification of Heart Failure
--------------------------------------------------------***--------------------------------------------------------Abstract Physicians classify patients into those with or without a specific disease. Classification trees are frequently used to classify patients according to the presence or absence of a disease. In the data-mining and machine learning, alternate classification schemes have been ...
متن کاملMining Biological Repetitive Sequences Using Support Vector Machines and Fuzzy SVM
Structural repetitive subsequences are most important portion of biological sequences, which play crucial roles on corresponding sequence’s fold and functionality. Biggest class of the repetitive subsequences is “Transposable Elements” which has its own sub-classes upon contexts’ structures. Many researches have been performed to criticality determine the structure and function of repetitiv...
متن کاملAnomaly Detection Using SVM as Classifier and Decision Tree for Optimizing Feature Vectors
Abstract- With the advancement and development of computer network technologies, the way for intruders has become smoother; therefore, to detect threats and attacks, the importance of intrusion detection systems (IDS) as one of the key elements of security is increasing. One of the challenges of intrusion detection systems is managing of the large amount of network traffic features. Removing un...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003